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Abstract- This paper exploits a VLSI architecture for geometrical mapping ad- 
dress computation. The geometric transformation is reviewed under the field 
of plane projective geometry, which evokes a set of basic transformations to 
be implemented for the general image processing. The homogeneous and 2- 
Dimensional cartesian coordinates are employed to represent the transforma- 
tions, each of which is implemented via an augmented CORDIC as a process- 
ing element. A specific scheme for a processor, utilizing fully-pipelining at the 
macro-level, parallel constant-factor-redundant arithmetic and fully-pipelining 
at the micro-level, is assessed to produce a single chip VLSI for the HDTV 
applications under the current state-of-art MOS technology. 


1 Introduction 

Geometrical transformations are widely discussed in the field of digital image processing 
such as high- definition television(HDTV), image recognition, interactive computer graphics 
and vision processing [1,2,3]. The primary interest of these transformations is to project 
an image in a different domain, to extract additional signal conveying the information of 
the image. Moreover, it affords value-added images over the conventional displaying via 
the high resolution, definition, and flexible framing. Consequently, a geometrical mapping 
processor is about to appear to support a real-time processing. In recent years, several 
geometrical mapping processing modules have been developed and applied successfully for 
an appropriate application, They are implemented either by popular graphics package or 
application software accompanying an acceleration box [5], or a VLSI Processor [6]. We 
are interested in a VLSI implementation of a processor to realize a real-time speed for TV 
image processing, with a sufficient set of transformations to make a value-added display. 

It has been known that two barriers have existed toward the development of such a pro- 
cessor. The first is the lack of a sufficiently high-speed arithmetic computation technique 
to generate the mathematical functions required for geometrical mapping. The second is 
the need for an extensive library of geometrical mapping functions. To overcome these, 
two key techniques have been developed in [4,6]: The first is a very high speed radix-2 
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signed- digit adder and the second is a pipelined micro-programmable arithmetic function 
generator. In this paper, we study the same problem with the goal of optimizing the overall 
functionality and performance. We achieve this goal by improving the basic cell. 

In the following section, we will review the requirement of the geometrical mapping 
processor by introducing its definition and applications. In Section 3, we will study varioui 
CORDIC schemes to implement a basic cell, which can be used to compose the necessary 
function set for the geometric transformations. 


2 Geometrical Mapper 

Transformation of a sub-image requires a mapping of the sub-image from one point to 
the transformed, pixel by pixel. To rearrange the image, it is necessary to calculate the 
destination address of each pixel, which is called a geometrical mapper - 

In the field of plane projective geometry, transformation from a point to another point 
is represented as a multiplication in homogeneous coordinates [10]. Let a 2-dimensional (2- 
D) point p x (a;, y) is represented as {ax, ay, o) in right-handed homogeneous coordinates, 
with a non-zero constant o. The vector p x is referenced to an origin (0, 0). The most useful 
transformations are translation, scaling and rotation, examples of which are respectively 
defined as: 

Trans{x,d) : translating p x to (x -|- d,y) 

Rot{x,6) : rotating the vector p x by an angle of B about X-axis 
Scale{x,c ) : scaling the vector p x by c along x-axis. 


(*, y) • Tranj(x, d) = (x + d, y) 
( x > V ) • Rot{x , 6) — ( xcosB — ysinQ, xsinB + ycosB) 

(x,y) • Scale{x,c) = (cx,y) 


Or, the composite of 3 different transformations in 2-D is represented by 


T = 


c • cosB sind 0 
— c • sinB cosB 0 , 

cd 0 1, 


U) 


( 2 ) 


which is called an affine transformation. The affine transformation is performed via a set 
of multiplication and trigonometric function. 

Easily observed, the affine transformation is a necessary transformation to map a sub- 
image into another area of the image domain, with sliding, re-sizing and proper rotation. 
Its immediate applications include sub-image generation for the multiple picture-in-picture 
(PIP) TV, image template generation for the recognition and vision /graphics processing 

Further sophisticate transformation useful for the general image processing is the spher- 
ical, which basically transforms between the plane and sphere surfaces. A spherical trans- 
formation from p x to q x = (it, v) can be represented by using a set of elementary functions, 
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such as square root, division, and squaring operations. 


rx 

>/r 2 — z 2 — y 2 

ry 

yfr 2 — z 2 — y 2 ’ 


( 3 ) 


where r denotes the curvature degree of sphere surface. A conventional way to implement 
the transformations starts from a software package, i.e., interactive graphics package. To 
implement a dedicate hardware, possibly a set of modular structures in VLSI, it is necessary 
to figure out a basic cell of those functions, and there has been two different approach: the 
first based on a set of elementary function generators and the second on a programmable 
module. For the first approach, fast function generators are necessary and the performance 
is limited by the slowest function generator. Apparently, the trigonometric functions are 
the bottleneck while being implemented via the first idea. To optimize the trigonometric 
function generation, while considering the regularity of its structure, CORDIC has been 
suggested the recursiveness of the CORDIC iteration has been misleading a concept that 
the second approach is not usually better than the first one. 

Recently, as VLSI technologies evolve, the effectiveness of the integration is not simply 
a complexity of the multiplication but also implies a communication complexity more 
than the multiplication complexity include regularity of the structure, simplicity of the 
design and localization of the interfacing. In these senses, CORDIC has been widely 
reviewed again, and shown to be appropriate for a couple of algorithmic processors. In 
brief, CORDIC is a set of recursive algorithms, which can be easily programmed to generate 
a set of elementary functions via a different mode and a proper zero-enforcing. It is also 
capable of vector-oriented processing. 


3 CORDIC Techniques 

In this section, we will review CORDIC functions to i) perform a vector transformation and 
ii) generate elementary functions. CORDIC comprises of three linear recursive equations, 
namely X — , Y— and Z— recurrences. Table 1 summarizes the computing mode, input 
and output specifications of CORDIC functions of our interest. As shown in the Table, 
these functions are classified into two cases, one which enforces Z[N) to be zero (known 
as rotating ) and the other which enforces Y[N] to be zero(known as vectoring ). We will 
discuss these cases in the following sections. 

3.1 Rotating case 

The vector rotation for p x = (X[0], V [0]) by the angle 6 can be realized by an iteration 
algorithm called CORDIC [12] instead of computing trigonometric functions and applying 
matrix multiplication. CORDIC realizes a vector rotation by a partial sum of micro-angle 
rotations with a pre-fixed sequence of angles. When the rotation macro- angle is represented 
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Mode 

Input 

Enforcing 

Output 

Circular 

2[o] = 0,pr[o],y[o]) 

Z[N] = 0 

Rotation by 8 

Circular 

z[oi = o,(x[o],y[o]) 

Y[N] = 0 

x[w] = ^x[op + r[o] 2 

Z[N ] = tan-^YtOj/XfO]) 

Linear 

z[oj = o,(jrjo],y[o]) 

Y[N] = 0 

W\ = y[d]/Jr{d] 

hyperboEc 

(x[o],y[oi) 

Y[N] = 0 

X[iV] = y/X[Q ] J - Y[ 0] s 


Table 1: Available CORDIC Processing 


as a sum of decomposed micro-angles, i.e 6 = 2fc=o 


'i = n ** 

*=0 

where kf, = cosO^ is a micro-scale composing a final scale factor, explained later. Such 
a specific form of the pre-fixed micro-angle sequence as tan -1 2 _i , is attractive for VLSI 
implementation since it is composed only of additions, stuffings, and a arctangent lookup 

table . t . •’ - 1 . v. , .. 

Non-redundant : The micro-iterations of the conventional (hereafter, it will be called 
non-redundant ) CORDIC use the following 3 Enear recursive equations |12]: 

f |i + + ft) 

Y\i 4- i] = Y[i] - 2-’X[i] 

Z\i + 1] = Z[i] — ari tan -1 2 _t (5) 


1 —ianOk 
tandk 1 


pi 


(4) 


where m will be set to one for the circular CORDIC, while m = 0 for the Enear and —1 
for the hyperboEc. With an initial value of Z\ 0] = 0, CORDIC rotates initial values of 
X[0] and Y[0], to the last value X[n] and Y[n] while making Z\i\ close to zero in each i 
iteration, so that Z[n ] is forced to be zero. With n number of iterations, n-bit accuracy of 
-A|iV] and Y[JV] can be achieved. For a known angle, the direction of the rotation, ff j can 
be pre-computed or calculated one by one on-the-fly using the following selection function. 


_ ] 1 if Z[ t] > 
0-4 \ -1 if Z[i) < 


( 6 ) 


The CORDIC rotation does not preserve the input norm. To get a rotated vector having 
the same length as the input (XfO], Y[0]), X[ra](Y[n]) heeds to be compensated by a scaBng 
factor K 

'|[*W,K[n]]'|| ■H* 


K = 




lP(o].rto]]'ii “ v-'-.- . ( 7 ) 

where || • || stands for the norm of the vector. Note that K is constant for the non-redundant 
scheme since <r \ is in {-1, 1}. 


iimi iii i iii ii mi mm i mi 
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Redundant : Non-redundant CORDIC is slow inherently with delay of 0(n 2 ) due to 
its recursiveness and serial dependency, since a micro- rotation with delay 0(n ) should be 
finished before processing the next micro- rotation. Delay performance of a macro-rotation 
(n micro-rotations) can be improved from 0(n 3 ) to 0(n ) by using redundant arithmetic 
(carry-free addition such as carry save or signed-digit addition) to determine the direction 
of the rotation dj, based on an estimate instead of an exact value [14]. The redundant 
arithmetic gives a delay of 0(1) instead of 0(n), and the estimation of direction is necessary 
not to erode the advantage of 0(1). This requires the modification of the recurrences and 
selection function. This redundant CORDIC scheme produces the output about 4 times 
faster than the non-redundant [14]. However, it introduces additional cost since the scale 
factor K is variable depending on a macro-angle by allowing d; to be in {-1, 0, 1}. 

Constant-Factor-Redundant : To reduce implementation cost of redundant CORDIC, 
it would be good to have a constant scale factor by forcing d^ in {-1, 1}. However, since dj 
is determined from an estimate, there arises a convergence assurance question. A scheme 
appending correcting iteration stages at proper positions was proposed for it [15]. Along 
to this idea, the number of extra correcting iterations is further reduced by dividing the 
micro-iterations (for » = 0 to i = n — 1) into two groups: one group where the direction of 
the rotation is in {-1, 1} for i = 0 to i = n/2 and the other in {-1, 0, 1} for i = (n + l)/2 
to i = n — 1 correcting iterations by 50 % since correcting iteration is not needed for the 
second half of the micro-iterations and we still obtain a constant scale factor K since the 
value of K in n-bit precision does not depend on the d value for (n + l)/2 < i < (n — 1). Z- 
recurrence also can be modified so that d; is determined quickly by looking at a few most 
significant bits. This new scheme is called Constant-Factor-Redundant-CORDIC(CFR- 
CORDIC). The modified recurrences and selection functions for the scheme are described 
below. 

X[i + 1] = X[i] + di2-Y[i] 

Y[i + 1] = y[{\ - aa^xii] 

U[i + 1] = 2(U[{\ - dj2’ tan -1 2-') (8) 

where U[i] is for the implementation simplicity, which is equal to 2 l Z[i], and the selection 
function is given as follows: 

1 if U\i] > 0 

or U[i] = 0 fl * < n/2 /g-j 

0 U[i\ = 0 n i > n/2 ^ 

-1 if £[t] < 0 

When i fractional bits are used in the estimate value, i.e., t/[i] is computed using t 
fractional bits of redundant representation of U[i], the following correcting iteration need 
to be included, where the interval between indexes of correcting iterations should be less 
than or equal to (t — 1) up to the last iteration index equal to n/2. When the correction 
stage is necessary at the jth step of micro-iteration, 

U c [j + 1] = U[j + 1] - 2&f2 i tan~ i 2- j 



( 10 ) 
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with the direction of the rotation a? determined from the same selection function of eq.( 9), 
except being decided based on U[j + 1] instead of U[i). 


3.2 Vectoring case 

While the rotating case affords vector-wise rotation to implement a geometrical mapper, 
the vectoring case does elementary functions as in Table 1. Apparent difference between 
the vectoring and rotating mode is the zero enforcing parameter, which necessitates a 
different selection function. For the conventional CORDtC, the recurrence equations are 
given: 

X[i + 1] = X\i] + ^2-^] 
y[i + 1 ] = Y\i] - 

Z[i + 1] = Z[i] + (Ti tan -1 2~ l (11) 

with the following selection function. 


r 1 if y|i] > 0 

\ -1 if 


if V[<] < o 


( 12 ) 


The selection function for CFR- CORDtC in vectoring has been developed shown beisw: 
Let W{i] = 2<Y[i] in the same token as for the rotating case, then 


X[i + i] = x\i] + a^Yii] 
W[i + 1] = 2(W[i] - <7,-X[i]) 
Z[i + 1] = Z[i] + ^ tan -1 2 - ' 


ti 


' i if w[i] > o 

^ or Wfi] = 0 (1 i < n/2 

0 W[i] = 0 fl i > n/2 
. -1 if W\i] < 0 


Here the correcting stage at the jth step is defined as follows: 


W c \j + 1] = w\j + 1] - 2 ifx[i + 1) 


(13) 

(14) 


(15) 


So far, we discussed about recursive structures of several CORDIC schemes to imple- 
ment the basic PE. The PE, augmented by a translator, necessitates scaling operation at 
each stage, because shuffling of the output at each stage makes continuous accumulation 
of the scaling factor complex to be processed at the final stage. The scaling operation 
has been solved either by an explicit way or an implicit. The explicit way is dividing the 
rotated vector by a constant, which is known for the non-redundant, to be calculated while 
running the micro-steps of CORDIC [12,14]. The division can be processed by another 
CORDIC (in a linear mode) or a divider. The implicit approach reconfigures the sequence 
of micro-iterations of the CORDIC, eventually to have a different norm from that without 
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scaling micro-iterations. Scaling micro-iterations target in general at making the adjusted 
scaling factor in a form of 2‘ or 1, which can be easily set to the unit size. Each micro- 
iteration can be composed of i) reduction axis-scaling [16], ii) repetition of vector-scaling, 
iii) expansion axis-scaling or combinations thereof. Relevant issues regarding search for 
the solution are to be further studied, better than the greedy method or the decomposed 
search [18]. In summary, the explicit scaling almost doubles the system complexity, while 
the implicit increases 25 % for non-redundant CORDIC and about 30 % for redundant 

CORDIC. 

3.3 VLSI Scheme 

To maximize the throughput of the geometric processor, the fully spanned architecture is 
selected. Affine transformer is a trivial case, which can be implemented by using a single 
CORDIC of which micro-iteration is expanded to include an addition. To impl ement a 
spherical transformer, 4 CORD ICs are configure d: i) circular square root of y/x* + V*. 
ii) hyperbolic square root of ^/r 2 - (V® 7 + y 2 ) 2 , and two iii) linear divisions of u and 
v. To get first estimates of the VLSI size, a typical TV image processing application is 
considered: O(10 4 5 6 ) pixel/image addressing and O(10 -1 )sec screen flashing. For the case, 
the number of input bits b { « y/pixel number, for which 12 bits are sufficient. To allow 
possible interpolations between pixels, bf is set to be 16. Each CORDIC module requires 
(bi + log 2 bi) steps of micro-iterations, and 30% additional iterations for an implicit scaling. 

For the spherical transformer, using fully spanned 4-CORDIC, the number of TRs are 
estimated about 30K (4*6K*1.3). 
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